Skip to content

Advanced error and result modeling in modern .NET systems

When systems become real, failure stops being a side concern.

In simple demo code, failure often looks like one thing: “something went wrong.” In production systems, especially WPF desktop systems connected to machines, cameras, PLCs, storage, and long-running workflows, that is not good enough. Some failures are expected. Some are exceptional. Some should stop everything immediately. Some should only produce a warning. Some should be shown to the operator in simple language. Some should only be logged for engineers. Some mean “retry later.” Some mean “your workflow logic is wrong.” These are very different things, and a good system should model them differently.

That is why advanced error handling is really about failure modeling, not only try/catch.


1. Big picture — why failure modeling matters as much as failure handling

A lot of engineers think about failure only at the point where code breaks. They think in terms of catch blocks, logs, message boxes, and retries. But the deeper issue is earlier than that: how do we represent failure in the first place?

That design choice shapes the whole system.

If every kind of failure becomes an exception, then callers cannot easily tell the difference between:

  • “recipe validation failed because the user entered an invalid threshold”
  • “machine rejected the command because it is not homed yet”
  • “camera disconnected unexpectedly”
  • “image save partially failed but inspection still completed”
  • “developer bug: impossible state reached”

These are not the same. They should not look the same in code. They should not be handled the same way in the UI. They should not be logged with the same severity. They should not trigger the same operational response.

In production systems, callers need a clear contract:

  • What can fail?
  • Is that failure expected or unexpected?
  • Is the caller supposed to handle it explicitly?
  • Can the workflow continue?
  • Should the operator intervene?
  • Should the system retry automatically?
  • Is the failure safe, unsafe, temporary, or fatal?

That is why failure modeling matters as much as failure handling. If the contract is vague, the code becomes vague. And vague code around failure becomes operational pain.

Real examples

A machine command execution service might expose this:

csharp
Task<bool> StartInspectionAsync(CancellationToken ct);

This tells the caller almost nothing.

Did false mean:

  • machine is disconnected?
  • machine is busy?
  • recipe is invalid?
  • safety interlock is active?
  • timeout?
  • SDK threw?
  • operation cancelled?
  • command rejected because state is wrong?

The caller now has to guess, or depend on side channels like logs, out parameters, global state, or message events. That is weak design.

A better API tells the truth:

csharp
Task<Result<StartInspectionOutcome, MachineCommandError>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

Now the contract says: this operation has a success outcome, and expected command-related failures are modeled explicitly. That makes the system more honest.


2. Different kinds of failure

One of the biggest design improvements a senior engineer brings is the ability to distinguish failure types instead of flattening them.

Unexpected technical failures

These are failures the normal caller is not expected to handle as part of regular business flow.

Examples:

  • null reference because of a bug
  • corrupted internal state
  • invalid cast
  • race condition causing impossible state
  • driver crash
  • unmanaged SDK access violation wrapped by adapter boundary
  • disk subsystem throwing unexpected IO exception outside known behavior

These are usually exception territory.

They represent either:

  • a programming error
  • an infrastructure fault outside the normal business contract
  • a system integrity issue

Expected business or domain failures

These are failures that are normal outcomes in the domain.

Examples:

  • recipe is invalid
  • machine is in wrong state for command
  • command rejected because door is open
  • inspection cannot start because wafer is not loaded
  • workflow step cannot proceed because preconditions are not satisfied

These are often better modeled as results, not exceptions.

They are not “surprising system breakage.” They are expected possibilities.

Validation errors

Validation errors deserve their own category because they are usually not runtime faults at all. They are input-quality problems.

Examples:

  • threshold is out of allowed range
  • required calibration file missing from recipe
  • scan area overlaps invalid region
  • operator entered negative exposure time
  • recipe references camera profile that does not exist

Validation is often multi-error by nature. A good validation model can return all issues together rather than failing on the first one.

Recoverable vs unrecoverable failures

This distinction matters operationally.

Recoverable:

  • transient network issue
  • temporary file lock
  • telemetry stream dropped but can reconnect
  • command timeout where retry is safe
  • image thumbnail save failed, core image save succeeded

Unrecoverable:

  • safety state violation
  • machine axis control unavailable
  • corrupted results package
  • invariant broken inside workflow engine
  • persistent camera initialization failure required for run correctness

The design question is not only “did it fail?” but also “what is the safe next action?”

Warnings vs hard failures

Warnings mean the operation is still considered complete enough to proceed, but something important should be recorded or shown.

Examples:

  • inspection completed but some thumbnails were not generated
  • data archived locally but cloud upload deferred
  • optional telemetry unavailable
  • recipe used fallback calibration

Warnings are very important in industrial systems because many workflows should continue in degraded mode instead of failing completely.

Partial success

This is one of the most under-modeled areas.

Examples:

  • 98 images saved, 2 thumbnails failed
  • inspection completed, but one auxiliary statistics export failed
  • report generated, but one optional annotation layer missing
  • wafer scanned, but one region had low-confidence classification and was flagged

If your system models everything as either success or failure, you lose important reality. Production systems often produce mixed outcomes.

Why treating all of these the same leads to poor design

If all of them become exceptions:

  • business logic gets buried in catch blocks
  • logs become noisy
  • expected operator actions look like crashes
  • retries become inconsistent
  • UI messages become technical
  • workflows become fragile

If all of them become bool:

  • callers lose meaning
  • support cannot diagnose quickly
  • monitoring becomes weak
  • partial success disappears
  • warnings get lost
  • unsafe conditions may be ignored

The whole point of good failure modeling is to preserve meaning.


3. Real problems in a WPF desktop app controlling a wafer inspection machine

This kind of system is exactly where weak failure contracts cause chaos.

Imagine a WPF application controlling a wafer inspection machine. It coordinates:

  • operator UI
  • recipe loading and validation
  • machine state
  • motion control commands
  • camera/image acquisition
  • processing pipeline
  • result storage
  • alarms/logging
  • long-running workflows

Now look at what can go wrong.

Machine command may fail for very different reasons

A StartInspectionAsync operation may fail because:

  • machine is disconnected
  • machine is not initialized
  • machine is already running
  • machine is in alarm state
  • safety door is open
  • vendor SDK timed out
  • SDK threw native exception
  • cancellation requested
  • command acknowledged but completion event never arrived

These have different meanings.

The UI should not show the same message for all of them.

The workflow should not respond the same way either. “Machine not homed” is an operator-correctable domain issue. “Access violation in vendor camera SDK” is an engineering-level technical fault.

Workflow may partially complete

A run may successfully inspect wafers but fail to:

  • save some preview images
  • upload telemetry
  • archive some raw debug traces
  • enrich secondary analytics
  • write optional audit attachment

That should not necessarily invalidate the whole run.

If your workflow engine only supports “success” or “throw,” it becomes too brittle.

UI needs operator-friendly messages while logs retain technical details

The operator needs something like:

Camera not available. Check connection and power, then retry.

The log needs something like:

VendorCameraException HRESULT=0x8007001F during InitializeCamera on CameraAdapter.InitializeAsync. Serial=CAM-07. NativeCode=DeviceBusy. DriverVersion=5.2.13.

Those are different views of the same incident. Good error modeling supports both.

Data pipeline may continue with degraded behavior

Suppose thumbnail generation fails because the GPU helper process crashes. The main inspection results are still valid and the machine can keep running. The workflow should continue, record a warning, and surface that the result package is degraded.

This is much better than either:

  • crashing the whole run, or
  • silently hiding the issue

Some failures should stop the run immediately

Examples:

  • stage lost synchronization
  • axis controller reported unsafe motion state
  • core result data cannot be persisted
  • recipe integrity invalidates measurement correctness
  • emergency stop triggered

These are fail-fast conditions.

Others should only mark warnings

Examples:

  • preview generation failed
  • optional telemetry unavailable
  • thumbnail save retried and still failed
  • background diagnostics export failed

These should not automatically stop production.

This is why failure contracts must be explicit. In this domain, failure is part of the workflow model.


4. Exceptions vs result-style modeling

This is where teams often become dogmatic.

Some teams say: “exceptions are bad; use Result everywhere.” Others say: “C# already has exceptions; use them for everything.”

Both extremes are usually wrong.

When exceptions are the right tool

Exceptions are the right tool when something happened that the caller is not expected to model as a routine branch.

Good examples:

  • programming bugs
  • invariant violations
  • impossible states
  • unexpected SDK crashes
  • serialization bug
  • null where contract guaranteed non-null
  • infrastructure fault outside the normal operation contract

Examples:

csharp
public async Task<InspectionPlan> BuildPlanAsync(Recipe recipe, CancellationToken ct)
{
    if (recipe is null) throw new ArgumentNullException(nameof(recipe));

    var machineConfig = await _configProvider.GetCurrentAsync(ct);
    if (machineConfig.AxisCount <= 0)
        throw new InvalidOperationException("Machine configuration is invalid.");

    // ...
}

This is fine. These are not “expected operator outcomes.”

When a Result-style return model is better

A result model is better when the caller should explicitly handle a known, normal possibility.

Examples:

  • validation failure
  • command rejection because state is wrong
  • business rule not satisfied
  • workflow step skipped
  • partial completion with warnings
  • optional operation failed but system can continue

Example:

csharp
public Task<Result<Unit, RecipeValidationError[]>> ValidateAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

That tells the caller: validation issues are expected, and you should handle them explicitly.

Why expected failures are often better as return values

Because they are part of the contract.

If the machine can validly reject a command because it is not in the correct state, that is not exceptional. It is a normal branch. Modeling it as an exception often pushes domain logic into technical control flow.

Example:

csharp
var result = await _machineService.StartInspectionAsync(recipe, ct);

if (result.IsFailure)
{
    switch (result.Error.Code)
    {
        case MachineCommandErrorCode.InvalidState:
            ShowOperatorMessage(result.Error.OperatorMessage);
            return;
        case MachineCommandErrorCode.MachineDisconnected:
            ShowReconnectPrompt();
            return;
        default:
            // escalate or fallback
            break;
    }
}

This is clearer than catching five different custom exceptions for normal machine rejection cases.

Why unexpected failures are often better as exceptions

Because they should travel fast, preserve stack information, and signal abnormal conditions clearly.

If the internal workflow engine reaches an impossible state, returning Result.Fail("unexpected") often hides the severity.

Example:

csharp
if (!_stateMachine.CanTransition(currentState, trigger))
    throw new InvalidOperationException(
        $"Invalid workflow transition from {currentState} using {trigger}.");

That is a system defect, not an expected domain outcome.

Trade-offs

Exceptions:

  • good for abnormal faults
  • preserve stack traces
  • integrate naturally with async/await
  • bad when overused for routine outcomes
  • can make expected failures invisible in signatures

Results:

  • make expected failure explicit
  • improve contract clarity
  • good for validation, domain rules, partial success
  • can become verbose
  • can lead to “plumbing fatigue” if overused everywhere

Experienced engineers do not choose one universal rule. They choose based on whether the failure is part of normal operation.


5. Result pattern in practice

A result pattern is simply a structured way to return outcome information without using exceptions for normal branches.

That sounds simple, but the important part is what you put into the result model.

A weak result model:

csharp
public sealed class Result
{
    public bool Success { get; init; }
    public string? Error { get; init; }
}

This is not enough for production systems.

A stronger model usually needs:

  • success/failure state
  • error code/category
  • operator-friendly message
  • technical details or metadata
  • warning collection
  • partial success support
  • maybe a typed success payload

Here is a realistic base model.

csharp
public enum ErrorCategory
{
    Validation,
    Domain,
    Technical,
    Timeout,
    Connectivity,
    Safety,
    Concurrency,
    Unexpected
}

public sealed record ErrorDetail(
    string Code,
    string Message,
    ErrorCategory Category,
    string? OperatorMessage = null,
    IReadOnlyDictionary<string, object?>? Metadata = null);

public sealed class Result
{
    private Result(bool isSuccess, IReadOnlyList<ErrorDetail> errors, IReadOnlyList<ErrorDetail> warnings)
    {
        IsSuccess = isSuccess;
        Errors = errors;
        Warnings = warnings;
    }

    public bool IsSuccess { get; }
    public bool IsFailure => !IsSuccess;
    public IReadOnlyList<ErrorDetail> Errors { get; }
    public IReadOnlyList<ErrorDetail> Warnings { get; }

    public static Result Success(params ErrorDetail[] warnings) =>
        new(true, Array.Empty<ErrorDetail>(), warnings);

    public static Result Failure(params ErrorDetail[] errors) =>
        new(false, errors, Array.Empty<ErrorDetail>());
}

public sealed class Result<T>
{
    private Result(bool isSuccess, T? value, IReadOnlyList<ErrorDetail> errors, IReadOnlyList<ErrorDetail> warnings)
    {
        IsSuccess = isSuccess;
        Value = value;
        Errors = errors;
        Warnings = warnings;
    }

    public bool IsSuccess { get; }
    public bool IsFailure => !IsSuccess;
    public T? Value { get; }
    public IReadOnlyList<ErrorDetail> Errors { get; }
    public IReadOnlyList<ErrorDetail> Warnings { get; }

    public static Result<T> Success(T value, params ErrorDetail[] warnings) =>
        new(true, value, Array.Empty<ErrorDetail>(), warnings);

    public static Result<T> Failure(params ErrorDetail[] errors) =>
        new(false, default, errors, Array.Empty<ErrorDetail>());
}

That is still compact, but much more usable.

Example: ValidationResult

Validation often returns multiple problems.

csharp
public sealed record ValidationIssue(
    string Code,
    string Field,
    string Message,
    string? SuggestedFix = null);

public sealed class ValidationResult
{
    private ValidationResult(IReadOnlyList<ValidationIssue> issues)
    {
        Issues = issues;
    }

    public IReadOnlyList<ValidationIssue> Issues { get; }
    public bool IsValid => Issues.Count == 0;

    public static ValidationResult Valid() => new(Array.Empty<ValidationIssue>());

    public static ValidationResult Invalid(params ValidationIssue[] issues) => new(issues);
}

Usage:

csharp
public ValidationResult ValidateRecipe(InspectionRecipe recipe)
{
    var issues = new List<ValidationIssue>();

    if (recipe.ExposureTimeMs <= 0)
        issues.Add(new("Recipe.Exposure.Invalid", "ExposureTimeMs", "Exposure time must be greater than zero."));

    if (string.IsNullOrWhiteSpace(recipe.CameraProfile))
        issues.Add(new("Recipe.CameraProfile.Missing", "CameraProfile", "Camera profile is required."));

    if (recipe.ScanRegions.Count == 0)
        issues.Add(new("Recipe.ScanRegions.Empty", "ScanRegions", "At least one scan region is required."));

    return issues.Count == 0
        ? ValidationResult.Valid()
        : ValidationResult.Invalid(issues.ToArray());
}

Example: StartInspectionResult

Sometimes a dedicated outcome type is even clearer than a generic result.

csharp
public enum StartInspectionStatus
{
    Started,
    Rejected,
    Warning
}

public sealed record StartInspectionOutcome(
    StartInspectionStatus Status,
    string? RunId,
    IReadOnlyList<ErrorDetail> Warnings);

public sealed record MachineCommandError(
    string Code,
    string Message,
    string OperatorMessage,
    bool Retryable,
    bool SafeToRetry,
    MachineCommandErrorCode ErrorCode);

public enum MachineCommandErrorCode
{
    InvalidState,
    MachineDisconnected,
    Timeout,
    AlarmActive,
    SafetyInterlock,
    RecipeInvalid
}

Service contract:

csharp
Task<Result<StartInspectionOutcome>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

Implementation sketch:

csharp
public async Task<Result<StartInspectionOutcome>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct)
{
    var validation = _recipeValidator.ValidateRecipe(recipe);
    if (!validation.IsValid)
    {
        return Result<StartInspectionOutcome>.Failure(
            new ErrorDetail(
                "Recipe.Invalid",
                "Recipe validation failed.",
                ErrorCategory.Validation,
                "Recipe is invalid. Review highlighted fields."));
    }

    if (!_machineState.CanStartInspection)
    {
        return Result<StartInspectionOutcome>.Failure(
            new ErrorDetail(
                "Machine.InvalidState",
                $"Machine state '{_machineState.Current}' does not allow StartInspection.",
                ErrorCategory.Domain,
                "Machine is not ready to start inspection.",
                new Dictionary<string, object?> { ["MachineState"] = _machineState.Current }));
    }

    try
    {
        var runId = await _machineAdapter.StartInspectionAsync(recipe, ct);
        return Result<StartInspectionOutcome>.Success(
            new StartInspectionOutcome(StartInspectionStatus.Started, runId, Array.Empty<ErrorDetail>()));
    }
    catch (OperationCanceledException)
    {
        throw;
    }
    catch (TimeoutException ex)
    {
        return Result<StartInspectionOutcome>.Failure(
            new ErrorDetail(
                "Machine.Start.Timeout",
                ex.Message,
                ErrorCategory.Timeout,
                "Machine did not respond in time. Retry after checking connection."));
    }
}

Example: SaveImageResult with warnings and partial success

csharp
public sealed record SaveImageOutcome(
    string ImageId,
    string MainPath,
    string? ThumbnailPath,
    bool ThumbnailSaved);

public async Task<Result<SaveImageOutcome>> SaveImageAsync(
    CapturedImage image,
    CancellationToken ct)
{
    var warnings = new List<ErrorDetail>();

    string mainPath;
    try
    {
        mainPath = await _imageStore.SaveMainImageAsync(image, ct);
    }
    catch (IOException ex)
    {
        return Result<SaveImageOutcome>.Failure(
            new ErrorDetail(
                "ImageSave.Main.Failed",
                ex.Message,
                ErrorCategory.Technical,
                "Failed to save image data.",
                new Dictionary<string, object?> { ["ImageId"] = image.Id }));
    }

    string? thumbnailPath = null;
    bool thumbnailSaved = false;

    try
    {
        thumbnailPath = await _imageStore.SaveThumbnailAsync(image, ct);
        thumbnailSaved = true;
    }
    catch (Exception ex)
    {
        warnings.Add(new ErrorDetail(
            "ImageSave.Thumbnail.Failed",
            ex.Message,
            ErrorCategory.Technical,
            "Preview thumbnail could not be generated.",
            new Dictionary<string, object?> { ["ImageId"] = image.Id }));
    }

    return Result<SaveImageOutcome>.Success(
        new SaveImageOutcome(image.Id, mainPath, thumbnailPath, thumbnailSaved),
        warnings.ToArray());
}

This is a realistic production pattern: core operation succeeded, but some secondary work degraded.

Example: WorkflowStepResult

csharp
public enum WorkflowStepStatus
{
    Completed,
    Skipped,
    Failed,
    CompletedWithWarnings
}

public sealed record WorkflowStepResult(
    string StepName,
    WorkflowStepStatus Status,
    IReadOnlyList<ErrorDetail> Errors,
    IReadOnlyList<ErrorDetail> Warnings,
    TimeSpan Duration);

This is much better than bool ExecuteStep() for workflow orchestration.


6. Domain errors vs technical errors

This separation is crucial.

A machine operator does not care about a native HRESULT or driver stack location. A support engineer does. A developer cares even more.

If you mix these levels, you get one of two bad outcomes:

  • users see meaningless technical messages
  • logs lose the technical context needed for diagnosis

Example: vendor SDK throws native exception

Suppose the camera SDK throws this:

csharp
VendorCameraException: DeviceOpen failed. Error 0x889A0001. NodeMap unavailable.

That should not leak directly to the operator UI.

At the machine adapter boundary, translate it into a domain-relevant or application-relevant fault.

csharp
public async Task<Result<CameraSession>> OpenCameraAsync(CancellationToken ct)
{
    try
    {
        var handle = await _sdk.OpenAsync(ct);
        return Result<CameraSession>.Success(new CameraSession(handle));
    }
    catch (VendorCameraException ex) when (ex.Code == VendorCameraErrorCodes.DeviceBusy)
    {
        _logger.LogWarning(ex,
            "Camera open failed because device is busy. CameraId={CameraId}", _cameraId);

        return Result<CameraSession>.Failure(
            new ErrorDetail(
                "Camera.Unavailable",
                ex.Message,
                ErrorCategory.Connectivity,
                "Camera is not available. Check connection and whether another process is using it.",
                new Dictionary<string, object?>
                {
                    ["CameraId"] = _cameraId,
                    ["VendorCode"] = ex.Code
                }));
    }
}

Why this separation matters

Because different layers need different language.

Infrastructure layer:

  • precise technical details
  • exception types
  • error codes from external dependency

Application layer:

  • meaningful categories
  • retryability
  • operator-safe wording
  • business impact

UI layer:

  • operator action guidance
  • severity
  • recoverability
  • maybe translated/localized message

Logs and telemetry:

  • full technical context
  • correlation id
  • adapter name
  • step name
  • external code
  • stack trace if applicable

Good systems preserve all of these without confusing them.


7. Partial failure and degraded operation

Real systems rarely fail in perfectly binary ways.

Modeling partial success

Suppose inspection finishes, core measurements are valid, but 7 thumbnails fail to save due to disk pressure. You need a model that can say:

  • workflow completed
  • result set is valid
  • some non-critical artifacts are missing
  • warnings should be visible
  • support should be able to trace exactly what degraded

That is not bool.

A realistic aggregate result

csharp
public sealed record InspectionCompletionResult(
    string RunId,
    bool CoreResultsSaved,
    int ImagesCaptured,
    int MainImagesSaved,
    int ThumbnailsSaved,
    bool TelemetryUploaded,
    IReadOnlyList<ErrorDetail> Warnings,
    IReadOnlyList<ErrorDetail> Errors)
{
    public bool IsSuccess => CoreResultsSaved && Errors.Count == 0;
    public bool IsCompletedWithWarnings => CoreResultsSaved && Warnings.Count > 0;
    public bool IsPartialSuccess => CoreResultsSaved && (Warnings.Count > 0 || Errors.Count > 0);
}

This kind of result tells the truth better.

Workflows that continue with warnings

A good orchestrator should know which failures are non-fatal.

csharp
foreach (var image in capturedImages)
{
    var saveResult = await _imageSaver.SaveImageAsync(image, ct);

    if (saveResult.IsFailure)
    {
        if (IsCritical(saveResult.Errors))
        {
            return AbortRun("Critical image persistence failure.", saveResult.Errors);
        }

        warnings.AddRange(saveResult.Errors);
        continue;
    }

    warnings.AddRange(saveResult.Warnings);
}

This is explicit. It is readable. It matches operational reality.

Degraded modes

Sometimes the system should enter degraded mode intentionally.

Examples:

  • telemetry stream unavailable → continue without live dashboard
  • optional analytics engine offline → continue with core inspection only
  • secondary image annotation service down → continue and mark post-processing incomplete

Model this as a first-class state, not an accidental afterthought.

csharp
public enum OperationMode
{
    Full,
    Degraded,
    SafeStop
}

Then your workflow state can include mode and reasons.

Collecting multiple errors instead of failing immediately

Validation is the obvious example, but workflows also benefit from aggregation in the right places.

For example, during shutdown:

  • motion stop failed on axis A
  • telemetry flush failed
  • one result file remained locked

You may want to collect all issues, not stop after the first, because shutdown diagnostics matter.

The key is to aggregate where it improves operator action or supportability, and fail fast where safety or correctness requires it.


8. Error propagation across layers

This is where mature design shows up.

A failure should not flow unchanged through every layer. It should be translated at boundaries so each layer sees what it needs.

A practical layered view

Infrastructure layer

Deals with:

  • SDK exceptions
  • IO exceptions
  • database failures
  • socket issues
  • serialization failures

This layer often catches low-level exceptions only when it can add context or translate meaningfully. Otherwise it may let them bubble.

Machine adapter layer

Converts vendor-specific behavior into machine-relevant outcomes.

It knows that:

  • vendor code 1042 means device busy
  • timeout during command acknowledgement likely means lost communication
  • certain faults are retryable
  • certain faults should map to operator-facing machine states

Application/workflow layer

Decides:

  • stop run or continue
  • warning or hard failure
  • retry or escalate
  • update run state
  • surface alarm
  • record audit event

UI/ViewModel layer

Decides:

  • what the operator sees
  • whether to disable buttons
  • whether to show modal error, banner, status line, or alarm panel
  • whether technical details are hidden or available in diagnostics screen

Example flow

Vendor SDK throws timeout:

csharp
TimeoutException("Command ACK not received within 1500 ms")

Machine adapter translates:

csharp
new ErrorDetail(
    "Machine.Command.Timeout",
    "Command ACK not received within 1500 ms",
    ErrorCategory.Timeout,
    "Machine did not respond in time.")

Workflow layer evaluates:

  • if this is a homing command, stop workflow
  • if this is optional light-control refresh, retry once and continue if safe

UI layer shows:

  • “Machine did not respond. Check machine connection and retry.”

Logging layer records:

  • command name
  • timeout duration
  • machine state
  • correlation id
  • adapter operation
  • original exception

That is good boundary translation.

Where to catch, where to rethrow, where to convert

Catch when:

  • you can add important context
  • you can translate to a meaningful domain/application result
  • you can decide recovery or fallback
  • you can preserve safety

Rethrow or allow bubbling when:

  • the layer cannot handle it meaningfully
  • it represents a programming/invariant failure
  • translation would only hide important technical truth

Convert to result when:

  • the caller is expected to branch on it
  • the failure is part of normal operation
  • you want explicit contract-driven handling

9. Async, pipelines, and failure contracts

Async code makes failure easier to lose.

That is one of the biggest real-world dangers.

Failure modeling in async methods

Async methods already use exceptions naturally through Task. That is useful, but also dangerous because it tempts teams to use exceptions for everything.

A good rule:

  • expected outcomes: model explicitly in the returned result
  • unexpected faults: let exceptions fault the task

Example:

csharp
public async Task<Result<InspectionFrame>> TryAcquireFrameAsync(CancellationToken ct)
{
    if (!_machineState.IsAcquisitionReady)
    {
        return Result<InspectionFrame>.Failure(
            new ErrorDetail(
                "Acquire.InvalidState",
                "Machine is not ready for acquisition.",
                ErrorCategory.Domain,
                "Machine is not ready to capture images."));
    }

    var frame = await _camera.AcquireAsync(ct); // unexpected SDK failures can still throw
    return Result<InspectionFrame>.Success(frame);
}

Failure propagation in Task-based flows

In orchestrations, it must be clear which failures:

  • fault the whole task
  • return as expected results
  • are aggregated into warnings
  • trigger cancellation of sibling operations

Without that clarity, async flows become impossible to reason about.

Channel/pipeline stage failure handling

In streaming pipelines, failures often happen inside background consumers:

  • image save loop
  • analytics stage
  • telemetry stage
  • result export stage

If a background loop throws and nobody observes it, the system may continue in a broken state silently.

That is extremely dangerous.

Example: hidden background save loop failure

Bad:

csharp
_ = Task.Run(async () =>
{
    await foreach (var image in _channel.Reader.ReadAllAsync(ct))
    {
        await _imageSaver.SaveImageAsync(image, ct);
    }
});

If that task faults, the workflow may never know.

Better:

csharp
private Task? _saveLoopTask;

public void StartSaveLoop(CancellationToken ct)
{
    _saveLoopTask = RunSaveLoopAsync(ct);
}

private async Task RunSaveLoopAsync(CancellationToken ct)
{
    await foreach (var image in _channel.Reader.ReadAllAsync(ct))
    {
        var result = await _imageSaver.SaveImageAsync(image, ct);

        if (result.IsFailure)
        {
            if (IsCritical(result.Errors))
            {
                throw new SavePipelineCriticalException(result.Errors);
            }

            _warningSink.Report(result.Errors);
        }

        if (result.Warnings.Count > 0)
            _warningSink.Report(result.Warnings);
    }
}

Then the orchestrator explicitly observes the task:

csharp
try
{
    await _saveLoopTask!;
}
catch (SavePipelineCriticalException ex)
{
    _logger.LogError(ex, "Image save loop failed critically.");
    await StopRunSafelyAsync();
    throw;
}

Partial pipeline failure vs full workflow cancellation

This is a key design choice.

Examples:

  • thumbnail stage fails → continue
  • core result persistence fails → cancel run
  • monitoring loop throws → maybe switch to degraded mode and raise alarm
  • PLC heartbeat lost → stop run immediately

The orchestrator should own this policy. Not every stage should decide alone.

Why hidden async failures are dangerous

Because the UI may still show “running,” but part of the system is dead.

That is worse than a visible crash. It is silent corruption of operational truth.


10. How we use this in .NET in practice

Here is the practical model I would recommend for many production .NET desktop systems.

Use exceptions for truly exceptional or unexpected failures

Examples:

  • code bugs
  • invariant violations
  • unexpected third-party crashes
  • impossible state transitions
  • misuse of internal API contracts

Use Result-like types for expected outcomes

Examples:

  • validation
  • command rejection
  • unavailable-but-handled machine state
  • partial success
  • warnings
  • skip/continue decisions

Map low-level faults into meaningful application errors

At boundaries, convert technical exceptions into application-relevant or domain-relevant outcomes where appropriate.

Carry error codes and safe messages

Have stable codes. Codes matter for support, automation, and observability.

Examples:

  • Recipe.Invalid
  • Machine.InvalidState
  • Machine.Command.Timeout
  • Camera.Unavailable
  • ImageSave.Thumbnail.Failed

Design APIs with explicit failure contracts

Some practical examples:

csharp
public interface IRecipeValidator
{
    ValidationResult Validate(InspectionRecipe recipe);
}

public interface IMachineCommandService
{
    Task<Result<StartInspectionOutcome>> StartInspectionAsync(
        InspectionRecipe recipe,
        CancellationToken ct);

    Task<Result> StopInspectionAsync(CancellationToken ct);
}

public interface IImagePersistenceService
{
    Task<Result<SaveImageOutcome>> SaveImageAsync(
        CapturedImage image,
        CancellationToken ct);
}

public interface IWorkflowStep
{
    Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);
}

This is much clearer than a mix of bool, void, random exceptions, and event-based side channels.

A more complete example

csharp
public sealed class InspectionWorkflow
{
    private readonly IMachineCommandService _machine;
    private readonly IImagePersistenceService _imagePersistence;
    private readonly ILogger<InspectionWorkflow> _logger;

    public InspectionWorkflow(
        IMachineCommandService machine,
        IImagePersistenceService imagePersistence,
        ILogger<InspectionWorkflow> logger)
    {
        _machine = machine;
        _imagePersistence = imagePersistence;
        _logger = logger;
    }

    public async Task<Result<InspectionCompletionResult>> RunAsync(
        InspectionRecipe recipe,
        IReadOnlyList<CapturedImage> images,
        CancellationToken ct)
    {
        var warnings = new List<ErrorDetail>();
        var errors = new List<ErrorDetail>();

        var startResult = await _machine.StartInspectionAsync(recipe, ct);
        if (startResult.IsFailure)
        {
            return Result<InspectionCompletionResult>.Failure(startResult.Errors.ToArray());
        }

        foreach (var image in images)
        {
            var saveResult = await _imagePersistence.SaveImageAsync(image, ct);

            if (saveResult.IsFailure)
            {
                if (saveResult.Errors.Any(e => e.Code == "ImageSave.Main.Failed"))
                {
                    errors.AddRange(saveResult.Errors);
                    break;
                }

                warnings.AddRange(saveResult.Errors);
                continue;
            }

            warnings.AddRange(saveResult.Warnings);
        }

        var completion = new InspectionCompletionResult(
            RunId: startResult.Value!.RunId!,
            CoreResultsSaved: errors.Count == 0,
            ImagesCaptured: images.Count,
            MainImagesSaved: images.Count - errors.Count,
            ThumbnailsSaved: images.Count - warnings.Count(w => w.Code == "ImageSave.Thumbnail.Failed"),
            TelemetryUploaded: true,
            Warnings: warnings,
            Errors: errors);

        if (errors.Count > 0)
            return Result<InspectionCompletionResult>.Failure(errors.ToArray());

        return Result<InspectionCompletionResult>.Success(completion, warnings.ToArray());
    }
}

That is not toy-level. It reflects real workflow thinking.


11. Common mistakes

These mistakes are very common because teams often evolve failure handling reactively.

Throwing exceptions for normal validation failures

Why it happens:

  • easy at first
  • framework culture sometimes encourages exception-first style
  • teams do not distinguish expected vs unexpected failure

What it causes:

  • noisy logs
  • harder control flow
  • awkward UI handling
  • validation treated like a crash path

Validation is usually not exceptional. It is an expected branch.

Swallowing exceptions and returning generic “failed”

Why it happens:

  • fear of crashes
  • rushed defensive coding
  • desire to “keep system running”

Example:

csharp
catch (Exception)
{
    return false;
}

What it causes:

  • lost diagnostic detail
  • impossible support investigation
  • hidden severity
  • meaningless UI messaging

This is one of the worst patterns in production code.

Returning bool with no reason

Why it happens:

  • simplicity
  • legacy habits
  • trying to avoid complexity

What it causes:

  • opaque contracts
  • caller confusion
  • side-channel dependency
  • inconsistent user messaging
  • poor observability

bool is often too weak for important operations.

Leaking low-level technical errors directly to UI

Why it happens:

  • shortcut from catch block to message box
  • no translation layer
  • internal exception text used as user communication

What it causes:

  • operator confusion
  • frightening or meaningless messages
  • poor UX
  • accidental exposure of irrelevant technical detail

Mixing domain failures and technical failures together

Why it happens:

  • no error taxonomy
  • ad hoc custom exceptions
  • lack of architecture ownership

What it causes:

  • retry logic becomes unreliable
  • workflow stop/continue decisions become inconsistent
  • hard-to-read code

Inconsistent result styles across the codebase

Examples:

  • some methods throw
  • some return bool
  • some return null
  • some return tuples
  • some use custom Result
  • some signal failures by events

This is chaos.

A large system needs conventions.

Hiding async/background failures

Why it happens:

  • fire-and-forget tasks
  • unobserved pipeline consumer faults
  • background services without supervision

What it causes:

  • silent data loss
  • stale UI state
  • partial dead system behavior
  • very long debugging cycles

No structured error codes or categories

Why it happens:

  • teams rely on free-form strings
  • support needs were not considered up front

What it causes:

  • impossible reporting aggregation
  • weak support playbooks
  • no stable contract for telemetry or alarm routing

12. Trade-offs

There is no free design.

Simplicity vs explicitness

A bool return is simple. A rich result is explicit.

The right choice depends on the importance and variability of failure.

For critical machine/workflow operations, explicitness usually wins.

Exception-based flow vs Result-based flow

Exception flow is concise for rare abnormal cases. Result flow is clearer for expected branching.

Use each where it fits. Overusing either creates pain.

Rich error models vs complexity

A very rich model can become heavy:

  • too many types
  • too much wrapping
  • too much mapping code

A very weak model becomes ambiguous.

Experienced engineers aim for enough structure to preserve meaning, but not so much that every method becomes ceremony.

Preserving detail vs keeping APIs readable

Every result does not need twenty fields.

Keep the surface contract readable:

  • code
  • category
  • operator-safe message
  • maybe metadata
  • warnings/errors collection where needed

Deeper technical detail can stay in logs or diagnostic context.

Consistency across system vs local optimization

One team may want a custom result per feature. Another wants one universal result type. Both extremes can be awkward.

Usually a good compromise is:

  • a shared base error/result model
  • specialized result payloads where the domain needs them
  • documented conventions for when to throw vs return result

That gives consistency without flattening everything.


13. Designing good failure contracts

A good failure contract tells the truth about the operation.

What makes a good failure contract

It should tell the caller:

  • what successful outcome looks like
  • what expected failures look like
  • whether partial success exists
  • whether warnings can be returned
  • whether exceptions still represent unexpected faults
  • what the caller is expected to do

How callers know what to expect

The contract should be visible in the signature and naming.

Bad:

csharp
Task<bool> ExecuteAsync();

Better:

csharp
Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);

Much better.

When API should force explicit handling

If the failure is business-significant, the API should make it hard to ignore.

Validation is a good example.

csharp
ValidationResult Validate(InspectionRecipe recipe);

This forces the caller to inspect validity and issues.

A machine start command is another example.

csharp
Task<Result<StartInspectionOutcome>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

The caller can no longer pretend that failure is just “maybe false.”

Examples

Machine service API

csharp
public interface IMachineService
{
    Task<Result<MachineStatusSnapshot>> GetStatusAsync(CancellationToken ct);
    Task<Result<StartInspectionOutcome>> StartInspectionAsync(InspectionRecipe recipe, CancellationToken ct);
    Task<Result> StopAsync(CancellationToken ct);
}

Expected command failures are explicit.

Workflow step API

csharp
public interface IWorkflowStep
{
    Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);
}

Better than exceptions for every skip, warning, or rejection.

Validation API

csharp
public interface IRecipeValidator
{
    ValidationResult Validate(InspectionRecipe recipe);
}

Do not make validation throw for normal invalid input.

Save pipeline API

csharp
public interface IImageSaver
{
    Task<Result<SaveImageOutcome>> SaveAsync(CapturedImage image, CancellationToken ct);
}

This supports partial success and warnings naturally.


14. Debugging and observability of result/field failures

One of the operational benefits of structured failure modeling is faster diagnosis.

How structured failure models help production debugging

If errors are modeled with codes and categories, support can quickly answer:

  • what happened
  • where it happened
  • how often it happens
  • which failures are operator errors vs system faults
  • which are retryable vs fatal

That is much better than searching logs for text fragments.

How error codes and categories improve supportability

Examples:

  • Recipe.Invalid
  • Machine.InvalidState
  • Machine.Command.Timeout
  • ImageSave.Thumbnail.Failed
  • Camera.Unavailable

These codes can drive:

  • dashboards
  • alert thresholds
  • support runbooks
  • alarm classifications
  • trend analysis

Correlating operator-visible failures with logs and telemetry

A strong pattern is to include:

  • operation/run id
  • error code
  • step name
  • machine id
  • recipe id
  • timestamp
  • correlation id

The operator may see:

Inspection completed with warnings. Preview images missing.

The log/telemetry can show:

  • RunId: R-20260417-1422
  • WarningCode: ImageSave.Thumbnail.Failed
  • Count: 12
  • Node: IPC-03
  • DiskFreeMB: 142
  • Step: ThumbnailGenerator

That correlation sharply reduces MTTR because support can go from symptom to cause much faster.

How experienced engineers use failure modeling to reduce MTTR

They design errors not just for code correctness, but for operations.

They ask:

  • can support distinguish operator misuse from machine fault?
  • can we count and trend this failure?
  • can we tell whether degraded mode was entered?
  • can we correlate the UI message to a stable code?
  • can we tell whether retries happened and why?

That is mature engineering.


15. Senior engineer mental model

This is the main shift.

A senior engineer stops thinking of failure as “the thing that happens in catch.” They think of failure as part of the domain model.

In a real system:

  • some negative outcomes are normal
  • some are warnings
  • some are partial success
  • some require recovery
  • some require operator action
  • some should stop immediately
  • some indicate a system defect

Those differences should appear in the design.

How experienced engineers think about expected outcomes vs true exceptions

They ask:

  • Is this an expected possibility in normal operation?
  • Does the caller need to branch on it?
  • Is it safe or unsafe?
  • Should it be visible in the signature?
  • Is it a business/domain outcome or a technical fault?
  • Does partial success matter here?
  • What should the operator see?
  • What should logs and telemetry retain?

If yes, it often belongs in a result model. If not, it may belong in exception flow.

How they keep error handling consistent across a large codebase

They establish conventions such as:

  • exceptions for unexpected/programming/invariant failures
  • result types for expected domain/application outcomes
  • validation returns structured validation result
  • adapter boundaries translate external faults into application-level errors
  • operator-facing messages never come directly from raw exceptions
  • background task failures must be observed and surfaced
  • error codes are stable and structured

This consistency matters more than theoretical purity.

How they design APIs that are honest about failure

Honest APIs tell callers what can happen.

Dishonest APIs hide important outcomes behind:

  • bool
  • null
  • generic exception
  • side effects
  • logs only

Good APIs make failure behavior discoverable and predictable.

How they keep failure understandable for both developers and operators

They separate views:

  • technical detail for developers and logs
  • meaningful categories for workflows
  • safe, actionable wording for operators

That separation is one of the marks of production-grade design.


A practical recommendation for interview-level thinking

If I had to summarize the whole topic into one practical rule set for a senior/principal interview, I would say this:

Use exceptions for things that are truly abnormal, unexpected, or represent bugs or broken assumptions.

Use Result-style models for things that are expected parts of business, workflow, machine state, validation, partial success, and warnings.

Translate low-level technical faults into meaningful application/domain failures at boundaries.

Design failure contracts explicitly, especially in long-running workflows, machine commands, and background pipelines.

Preserve enough detail for logs, telemetry, support, and diagnosis, but keep operator messages clear, safe, and actionable.

And above all: do not treat all failures as the same kind of thing. In real systems, failure has shape. Good engineers model that shape clearly.

If you want, next I can turn this into an interview-ready version with likely follow-up questions and strong sample answers.

Docs-first project memory for AI-assisted implementation.